Currently, I'm training an agent for the MountainCarContinuous task using two consecutive images as input and applying the ppo algorithm. However, I'm encountering an issue where the controller output is frequently saturated at either -1 or 1. While this achieves a mean reward of approximately +94 (with an episode length of around 70), it is unable to attain the optimal reward of +96 (with an episode length of around 300) that is achievable with state-based input.
Is PPO appropriate for this particular case? or should I attempt other algorithms instead?
I've included my training code implemented using stable baseline below. Are there any problems with these hyperparameters or if they seem ususal?
num_envs = 8 env = make_vec_env("VisionMountainCarContinuous-v0", n_envs=num_envs) env = VecFrameStack(env, n_stack=2) policy_kwargs = dict( log_std_init=-0.0, ortho_init=False, activation_fn=nn.GELU, net_arch=dict(pi=[256], vf=[256]) ) model = PPO('CnnPolicy', env, n_steps = 512, n_epochs= 10, batch_size=128, learning_rate=linear_schedule(1e-4), clip_range=0.2, vf_coef = 0.5, ent_coef = 0.0, max_grad_norm=0.5, gae_lambda=0.95, gamma=0.99, use_sde=True, sde_sample_freq=4, policy_kwargs=policy_kwargs, verbose=1, seed = 533) eval_callback = EvalCallback(env, best_model_save_path='./logs/', log_path='./logs/', eval_freq=10000, n_eval_episodes=20, deterministic=True, render=False) checkpoint_callbak = CheckpointCallback(save_freq=5000, save_path='./logs/', name_prefix='ppo_cartpole_seed_533') callback = CallbackList([eval_callback, checkpoint_callbak]) model.learn(total_timesteps=400000, log_interval=10, callback=callback)
submitted by /u/cfybasil
[link] [comments]
( 42
min )